Monte Carlo Analysis (Part 2)

Edward Vytlacil

Last Lecture

  • Generated random draws in R.

  • Approximated \(\mathbb{E}_F[g(X)]\) by \[\frac{1}{B}\sum_{b=1}^B g(X_b), \qquad X_1,\ldots,X_B \stackrel{i.i.d.}{\sim} F.\]

  • Justified by the WLLN when \(\mathbb{E}_F[|g(X)|] < \infty\);

  • MC standard error of order \(1/\sqrt{B}\).

Today

Use MC simulation to:

  • Evaluate finite-sample properties of estimators and inference procedures;

  • Illustrate and visualize theoretical results for estimators and inference procedures.

Why? Asymptotic theory tells us what happens as \(n \to \infty\). MC simulation tells us how well those approximations work at a given sample size, and makes abstract results concrete.

Today (cont’d)

Why? Asymptotic theory tells us what happens as \(n \to \infty\). MC simulation tells us how well those approximations work at a given sample size, and makes abstract results concrete.

Limitation: MC results are specific to the chosen \(F\) and \(n\). Unlike analytical asymptotic results, which characterize behavior across a class of DGPs, each MC experiment provides evidence for one particular design. To build broader understanding, we explore a range of DGPs and sample sizes.

Theory Guides MC Design

MC simulation is most informative when guided by theory.

  • Theory identifies where problems should arise — MC quantifies how bad they are at specific \((n, F)\).

  • Theory identifies which parameters matter — this guides us on what to vary in the simulation.

Example: The delta method for \(g(\hat{\theta})\) relies on \(g\) being approx. linear over the region where \(\hat{\theta}\) lies with high probability. For \(g(x) = 1/x\), the curvature \(g''(x) = 2/x^3\) is large near zero — so theory predicts the approx. will be poor when \(\theta\) is small relative to the SE of \(\hat{\theta}\). This tells us which DGPs to examine.

Theory Guides MC Design (cont’d)

A good MC study follows a pattern:

  1. Theory identifies the key quantities that govern performance (e.g., \(\mu/(\sigma/\sqrt{n})\) for the delta method applied to \(1/\bar{X}\)).

  2. MC at one DGP confirms the theory and reveals the magnitude of the problem (for given DGP).

  3. MC across DGPs varies the key quantities to map out where asymptotic approximations perform better/worse.

We focus on steps 1–2 for each example, but we note what a fuller MC study would explore.

Setup and Notation

  • Suppose we have \(n\) i.i.d. draws from a distribution \(F\): \[W_1, \ldots, W_n \stackrel{i.i.d.}{\sim} F.\]

  • Here \(W_i\) is the full data vector for observation \(i\) — e.g., \(W_i = (Y_i, X_i)\) in a regression setting.

  • Let \(T_n = T(W_1,\ldots,W_n)\) be a statistic (an estimator, a test statistic, etc.).

Setup and Notation (cont’d)

  • The sampling distribution of \(T_n\) is determined by \(n\) and \(F\). Denote its CDF by: \[G_n(u, F) \;=\; \mathbb{P}_F\!\left[\,T_n \le u\,\right].\]

  • \(G_n(\cdot, F)\) depends on the DGP \(F\) and the sample size \(n\).

  • Asymptotic theory approximates \(G_n(\cdot, F)\) for large \(n\) (e.g., by the CLT).

  • MC simulation approximates \(G_n(\cdot, F)\) directly, for any \(n\) and any \(F\) we can simulate from.

The MC Recipe

To approximate \(G_n(\cdot, F)\) by simulation:

For \(b = 1, \ldots, B\):

  1. Draw a sample \(W_1^{(b)}, \ldots, W_n^{(b)} \stackrel{i.i.d.}{\sim} F\).
  2. Compute the statistic \(T_n^{(b)} = T(W_1^{(b)}, \ldots, W_n^{(b)})\).

Approximate \(G_n(\cdot, F)\) by the empirical distribution of \(T_n^{(1)}, \ldots, T_n^{(B)}\), \[\hat{G}_{n,B}(u, F) \;=\; \frac{1}{B}\sum_{b=1}^B \mathbf{1}\!\left[T_n^{(b)} \le u\right].\]

The MC Recipe (cont’d)

Approximate \(G_n(\cdot, F)\) by the empirical distribution of \(T_n^{(1)}, \ldots, T_n^{(B)}\), \[\hat{G}_{n,B}(u, F) \;=\; \frac{1}{B}\sum_{b=1}^B \mathbf{1}\!\left[T_n^{(b)} \le u\right].\]

  • Any feature of \(G_n(\cdot, F)\) that can be expressed as \(\mathbb{E}_{G_n}[h(T_n)]\) can be approximated by \(\frac{1}{B}\sum_{b=1}^B h(T_n^{(b)})\).

The MC Recipe: Key Distinction

  • The researcher chooses \(F\) and \(n\) — these define the DGP.

  • The researcher chooses \(B\) — this controls MC accuracy.

  • \(n\) and \(B\) play very different roles:

    • \(n\) = sample size in each simulated dataset. Changing \(n\) changes the object \(G_n(\cdot, F)\) being studied.
    • \(B\) = number of MC replications. Changing \(B\) changes the precision of our approximation of \(G_n(\cdot, F)\).

Evaluating Estimators

Suppose \(T_n = \hat{\theta}_n\) is an estimator of a parameter \(\theta(F)\).

From the MC draws \(\hat{\theta}_n^{(1)}, \ldots, \hat{\theta}_n^{(B)}\), approximate:

  • Bias: \(\;\mathbb{E}_F[\hat{\theta}_n] - \theta(F) \;\approx\; \bar{\hat{\theta}} - \theta(F)\), \(\quad\) where \(\bar{\hat{\theta}} = \frac{1}{B}\sum_{b=1}^B \hat{\theta}_n^{(b)}\).

  • Variance: \(\;\text{Var}_F(\hat{\theta}_n) \;\approx\; \frac{1}{B-1}\sum_{b=1}^B \left(\hat{\theta}_n^{(b)} - \bar{\hat{\theta}}\right)^2\).

  • MSE: \(\;\mathbb{E}_F[(\hat{\theta}_n - \theta)^2] \;\approx\; \frac{1}{B}\sum_{b=1}^B \left(\hat{\theta}_n^{(b)} - \theta(F)\right)^2\).

All are MC approximations of \(\mathbb{E}_{G_n}[h(T_n)]\) for appropriate \(h\).

Note: We Know \(\theta(F)\)

  • A crucial feature of MC simulation: we chose \(F\), so we know \(\theta(F)\).

  • With real data, we don’t know \(\theta\) — that’s why we need an estimator.

  • In MC simulation, we know \(\theta\) — so we can directly evaluate how well the estimator performs.

Examples

We will work through five examples of increasing complexity:

  1. Estimating \(p\)

    • bias-variance tradeoff between MLE and Laplace smoothing.
  2. Estimating \(1/p\)

    • small-denominator problems and the delta method.
  3. Estimating \(1/\mu\) with normal data

    • nonexistent moments, connecting to diagnostics from Part 1.

Examples (cont’d)

  1. Heteroskedastic standard errors
    • setting where the estimator is well behaved but SEs are problematic.
  2. Post-model-selection inference
    • a multi-step algorithm where standard inference breaks down.

Examples (cont’d)

  • Example 1: closed-form finite-sample results are available and informative — we can compare MC simulation to known analytical results.

  • Examples 2–4: closed-form finite-sample results are unavailable or uninformative, motivating MC analysis.

  • Example 5: closed-form results exist only under normality with fixed covariates, motivating MC analysis to explore robustness to non-normality.

  • In each case, theory guides us on what to look for and which parameters matter.

Example 1: Estimating a Probability

Suppose \(W_i \stackrel{i.i.d.}{\sim} \text{Bernoulli}(p)\), and we want to estimate \(p\).

Two estimators:

  • MLE / sample mean: \(\hat{p} = \frac{1}{n}\sum_{i=1}^n W_i\).

  • Laplace-smoothed estimator: \(\tilde{p} = \frac{\sum_{i=1}^n W_i + 1}{n + 2}\).

Laplace-smoothed estimator

  • Laplace-smoothed estimator: \(\tilde{p} = \frac{\sum_{i=1}^n W_i + 1}{n + 2}\).

    • \(\tilde{p}\) shrinks toward \(1/2\) — adds one “pseudo-success” and one “pseudo-failure.”

      • Shrinkage is of order \(1/n\).
    • Equivalently, \(\tilde{p}\) is the posterior mean under a \(\text{Beta}(1,1)\) (i.e., uniform) prior.

    • We will return to shrinkage estimators later in the course (ridge regression, lasso).

Analytical Properties

Since \(\sum W_i \sim \text{Binomial}(n, p)\):

\(\hat{p}\) (MLE) \(\tilde{p}\) (Laplace)
Bias \(0\) \(\frac{1 - 2p}{n+2}\)
Variance \(\frac{p(1-p)}{n}\) \(\frac{n \, p(1-p)}{(n+2)^2}\)
  • \(\tilde{p}\) is biased toward \(1/2\): the bias is positive when \(p < 1/2\), negative when \(p > 1/2\), and zero at \(p = 1/2\).

  • \(\tilde{p}\) has strictly smaller variance than \(\hat{p}\) since \(\frac{n}{(n+2)^2} < \frac{1}{n}\).

  • MSE comparison depends on \(p\) and \(n\).

What Theory Tells Us

  • The bias of \(\tilde{p}\) is \((1 - 2p)/(n+2)\) — largest when \(p\) is far from \(1/2\), and shrinks with \(n\).

  • The variance reduction factor is \(n/(n+2)^2 \div 1/n = n^2/(n+2)^2\) — nearly 1 for large \(n\).

Theory guides the DGP choice: we pick \(n = 50\) and \(p = 0.05\) (far from \(1/2\), moderate \(n\)) to make the bias-variance tradeoff visible.

Estimating \(p\): Bias and MSE

set.seed(5551)
B <- 100000; n <- 50; p <- 0.05

sim_p <- function(n, p) {
  W <- rbinom(1, n, p)  # sum of n Bernoulli draws
  p_hat   <- W / n
  p_tilde <- (W + 1) / (n + 2)
  c(p_hat = p_hat, p_tilde = p_tilde)
}

results <- replicate(B, sim_p(n, p))
p_hat_draws   <- results["p_hat", ]
p_tilde_draws <- results["p_tilde", ]
# MLE
c(bias = mean(p_hat_draws) - p,
  var  = var(p_hat_draws),
  MSE  = mean((p_hat_draws - p)^2))
      bias        var        MSE 
-0.0000054  0.0009500  0.0009500 
# Laplace
c(bias = mean(p_tilde_draws) - p,
  var  = var(p_tilde_draws),
  MSE  = mean((p_tilde_draws - p)^2))
   bias     var     MSE 
0.01700 0.00088 0.00120 

Estimating \(p\): Sampling Distributions

  • \(\hat{p}\) is unbiased but has substantial mass at zero. \(\tilde{p}\) is biased toward \(1/2\) but has lower variance and no mass at zero.

MC vs. Exact: Validation

Quantity Estimator Exact MC
Bias MLE 0.00000 -0.00001
Variance MLE 0.00095 0.00095
MSE MLE 0.00095 0.00095
Bias Laplace 0.01731 0.01730
Variance Laplace 0.00088 0.00088
MSE Laplace 0.00118 0.00118
  • MC estimates closely match the exact values — the small discrepancies are MC noise of order \(1/\sqrt{B}\).

  • Good practice: when exact results are available, use them to validate your simulation code before moving to settings where MC is the only option.

Bias–Variance Tradeoff Across \(p\)

  • Sample mean (MLE) has lower MSE when \(p\) is near 0 or 1 where the bias of \(\tilde{p}\) is largest; Laplace has lower MSE near \(p = 1/2\) .

Example 1: Fuller Analysis

A more complete MC study would additionally vary \(n\):

  • For large \(n\), the shrinkage of order \(1/n\) becomes negligible — the two estimators converge.

  • Laplace differs more from MLE when \(p\) and \(n\) are small.

Theory tells us the key quantity is \(np\) (expected number of successes): when \(np\) is small, the discrete nature of \(\hat{p}\) dominates; when \(np\) is large, the normal approximation becomes more reliable and bias-variance tradeoffs are modest.

Example 2: Estimating \(1/p\)

Now consider \(\theta(p) = 1/p\). Arises in, e.g., estimating expected number of trials until first success.

  • Plug-in MLE: \(1/\hat{p}\).

  • Laplace plug-in: \(1/\tilde{p} = (n+2)/(\sum W_i + 1)\).

Example 2: Estimating \(1/p\)

Now consider \(\theta(p) = 1/p\). Arises in, e.g., estimating expected number of trials until first success.

  • Plug-in MLE: \(1/\hat{p}\).

    • Undefined when \(\hat{p} = 0\), which occurs with probability \((1-p)^n\).
  • Laplace plug-in: \(1/\tilde{p} = (n+2)/(\sum W_i + 1)\).

    • Always well-defined and bounded.

Example 2: Estimating \(1/p\)

  • Plug-in MLE: \(1/\hat{p}\).

  • Laplace plug-in: \(1/\tilde{p} = (n+2)/(\sum W_i + 1)\).

  • Exact expressions for bias, variance, and the distribution of \(1/\hat{p}\) and \(1/\tilde{p}\) can be written as finite sums over the binomial distribution, but they do not simplify to informative closed-form formulas.

Example 2: Estimating \(1/p\) (cont’d)

  • Plug-in MLE: \(1/\hat{p}\).

    • Delta method approximation for \(1/\hat{p}\): \[\text{Var}\left( \frac{1}{\hat{p}} \right) \approx \frac{1-p}{n p^3}.\]

    • But delta method relies on \(g(x) = 1/x\) being approx. linear over the region where \(\hat{p}\) lies with high probability.

    • Since \(g''(x) = 2/x^3\), the nonlinearity is severe near \(x = 0\). The approximation is poor when \(p\) is small and \(\hat{p}\) has substantial mass near zero.

Delta Method: Visualization

The delta method approximates \(g(x) = 1/x\) by its tangent line at \(\theta = 1/p\). This works when \(\hat{p}\) concentrates in a region where the tangent line is close to \(g\).

Delta Method: Visualization

Estimating \(1/p\): Simulation

set.seed(5551)
B <- 10000; n <- 50; p <- 0.05
theta <- 1/p  # 1/p = 20

sim_inv_p <- function(n, p) {
  W <- rbinom(1, n, p)
  p_hat   <- W / n
  p_tilde <- (W + 1) / (n + 2)
  # 1/p_hat is Inf when W = 0; exclude those
  inv_hat   <- ifelse(W > 0, 1 / p_hat, NA)
  inv_tilde <- 1 / p_tilde
  c(inv_hat = inv_hat, inv_tilde = inv_tilde, W = W)
}

results <- replicate(B, sim_inv_p(n, p))
inv_hat_draws   <- results["inv_hat", ]
inv_tilde_draws <- results["inv_tilde", ]
W_draws         <- results["W", ]

Estimating \(1/p\): Simulation

sim_inv_p <- function(n, p) {
  W <- rbinom(1, n, p)
  p_hat   <- W / n
  p_tilde <- (W + 1) / (n + 2)
  # 1/p_hat is Inf when W = 0; exclude those
  inv_hat   <- ifelse(W > 0, 1 / p_hat, NA)
  inv_tilde <- 1 / p_tilde
  c(inv_hat = inv_hat, inv_tilde = inv_tilde, W = W)
}

results <- replicate(B, sim_inv_p(n, p))
inv_hat_draws   <- results["inv_hat", ]
inv_tilde_draws <- results["inv_tilde", ]
W_draws         <- results["W", ]

# Fraction of replications where p_hat = 0
frac_zero <- mean(W_draws == 0)
frac_zero
Pr(p_hat = 0) = 0.0765 

Estimating \(1/p\): Simulation

# MLE (excluding p_hat = 0 cases)
c(bias = mean(inv_hat_draws, na.rm = TRUE) - theta,
  var  = var(inv_hat_draws, na.rm = TRUE),
  MSE  = mean((inv_hat_draws - theta)^2, na.rm = TRUE))
  bias    var    MSE 
  4.96 206.00 231.00 
  • \(1/\hat{p}\) is highly variable, upward-biased (even excluding \(\hat{p}=0\) cases).
# Laplace
c(bias = mean(inv_tilde_draws) - theta,
  var  = var(inv_tilde_draws),
  MSE  = mean((inv_tilde_draws - theta)^2))
  bias    var    MSE 
 -1.09 124.00 126.00 
  • Smoothing stabilizes the denominator, reducing MSE.

Delta Method vs. MC Distributions

Delta Method vs. MC Distributions

The delta method approximation for \(1/\hat{p}\): \[\frac{1}{\hat{p}} \;\overset{\cdot}{\sim}\; N\!\left(\frac{1}{p},\;\; \frac{1-p}{n\,p^3}\right).\] The delta method normal (red curve previous slide)

  • is centered too low (misses the upward bias of \(1/\hat{p}\))
  • is too narrow (understates dispersion).
  • misses right-skew of sampling distribution.

Variability of \(1/\hat{p}\): Delta Method vs. MC

For \(p=0.05\), \(n=50\),

Method Std. Dev.
Delta method 12.33
MC (excl. p-hat = 0) 14.37
  • Delta method understates the variability of \(1/\hat{p}\) because it relies on approximate linearity of \(1/x\) over the region where \(\hat{p}\) lies with high probability.
  • When \(p\) is small, \(\hat{p}\) lies with high prob. in a region where \(1/x\) is highly nonlinear — delta method approx. is poor at fixed \(n\), though it improves as \(n \to \infty\) (as \(\hat{p}\) concentrates around \(p\)).

Delta Method: Inference

The delta method suggests the 95% CI: \(\;\;1/\hat{p} \;\pm\; z_{\alpha/2}\,\sqrt{(1-\hat{p})\,/\,(n\,\hat{p}^3)}\).

How does coverage depend on \(p\) (with \(n = 50\))?

Simulation code
set.seed(5551)
B <- 10000; n <- 50

sim_delta_cov <- function(n, p) {
  theta <- 1/p
  W <- rbinom(1, n, p)
  p_hat <- W / n
  if (W == 0) return(NA)
  delta_se <- sqrt((1 - p_hat) / (n * p_hat^3))
  ci <- 1/p_hat + c(-1, 1) * qnorm(0.975) * delta_se
  ci[1] <= theta & theta <= ci[2]
}

p_vals <- c(0.01, 0.05, 0.25)

df_cov <- data.frame(
  p = paste0("$", p_vals, "$"),
  theta = paste0("$", 1/p_vals, "$"),
  coverage = sapply(p_vals, function(pp)
    mean(replicate(B, sim_delta_cov(n, pp)), na.rm = TRUE))
)

knitr::kable(df_cov, digits = 3, align = "ccc",
             col.names = c("$p$", "$1/p$", "Coverage (nominal 95%)"))
\(p\) \(1/p\) Coverage (nominal 95%)
\(0.01\) \(100\) 0.776
\(0.05\) \(20\) 0.891
\(0.25\) \(4\) 0.943

Takeaway: Small Denominators

  • When an estimator involves a ratio, and the denominator can be small, be cautious about:
    • The delta method for variance and distributional approximations.
    • Plug-in estimators that can explode.
  • Regularization (here, Laplace smoothing) stabilizes the denominator — a theme we will return to with ridge regression and lasso.

Example 2: Fuller Analysis

Theory identifies the key quantity: the delta method approximation for \(g(\hat{p})\) depends on \(g\) being approximately linear over the region where \(\hat{p}\) lies with high probability.

  • For \(g(x) = 1/x\) with \(g''(x) = 2/x^3\), the curvature is large near zero.

  • So the delta method should improve as \(p\) moves away from zero (less mass near zero) and as \(n\) increases (the distribution of \(\hat{p}\) concentrates).

A fuller MC study would vary \(n\) and \(p\) systematically.

Example 3: Nonexistent Moments

Recall from last lecture: MC approximation of \(\mathbb{E}[g(X)]\) fails when \(\mathbb{E}[|g(X)|] = \infty\).

Can this arise with familiar estimators?

Consider \(X_i \stackrel{i.i.d.}{\sim} N(\mu, \sigma^2)\) and the estimand \(\theta = 1/\mu\).

  • Plug-in estimator: \(T_n = 1/\bar{X}\).

  • \(\bar{X} \sim N(\mu, \sigma^2/n)\) \(\Rightarrow\) the density of \(\bar{X}\) is bounded away from zero in a neighborhood of the origin \(\Rightarrow \mathbb{E}[|1/\bar{X}|] = \infty\).

Nonexistent Moments: Not Just Division by Zero

The issue is not that \(\bar{X}\) can equal zero (it does so with probability zero for continuous \(\bar{X}\)).

The issue is that \(\bar{X}\) has density bounded away from zero in a neighborhood of \(0\) \(\Rightarrow\) \(\int |1/x| \, \phi(x;\mu,\sigma^2/n) \, dx = \infty\).

  • This is true even when \(\mu\) is far from zero, but the practical impact depends on \(\mu / (\sigma/\sqrt{n})\): when this ratio is large, the density of \(\bar{X}\) puts so little probability mass in a neighborhood of zero that the nonexistent moment has no practical consequence.

Nonexistent Moments: Simulation

DGP: \(X_i \stackrel{i.i.d.}{\sim} N(\mu, 1)\), \(n = 20\), estimand \(\theta = 1/\mu\).

set.seed(5551)
B <- 10000; n <- 20; sigma <- 1

sim_inv_mean <- function(n, mu, sigma) {
  xbar <- mean(rnorm(n, mu, sigma))
  1 / xbar
}

# mu = 2: ratio mu/(sigma/sqrt(n)) = 2/(1/sqrt(20)) ≈ 8.9
inv_mean_far <- replicate(B, sim_inv_mean(n, mu = 2, sigma))

# mu = 0.2: ratio ≈ 0.89
inv_mean_near <- replicate(B, sim_inv_mean(n, mu = 0.2, sigma))

\(\mu = 2\): No Diagnostic Red Flags

  • With \(\mu=2\), the running mean looks stable — \(\mathbb{E}[1/\bar{X}]\) doesn’t exist, but the diagnostic red flags are absent because \(\bar{X}\) so rarely comes close to zero.

\(\mu = 0.2\): Diagnostic Red Flag

  • With \(\mu=0.2\), the running mean shows erratic jumps — the diagnostics from last lecture flag this.

How stability of \(1/\bar{X}_n\) depends on \(\mu\).

Since \(\bar{X}_n \sim N(\mu, \sigma^2/n)\), we can compute the probability that \(\bar{X}_n\) passes near zero, where \(1/\bar{X}_n\) explodes. With \(n=20\), \(\sigma = 1\):

\(\mu\,/\,(\sigma/\sqrt{n})\) \(\Pr(\vert\bar{X}_n\vert < 0.1)\)
\(\mu = 2\) 8.9 0.000000
\(\mu = 0.2\) 0.9 0.237504

With \(\mu=2\), can show \(\Pr( | \bar{X}_N | < 0.1)< 10^{-17}\) by the Mill’s ratio bound, floating-point arithmetic cannot distinguish it from zero.

How stability of … (cont’d)

Since \(\bar{X}_n \sim N(\mu, \sigma^2/n)\), we can compute the probability that \(\bar{X}_n\) passes near zero, where \(1/\bar{X}_n\) explodes. With \(n=20\), \(\sigma = 1\):

\(\mu\,/\,(\sigma/\sqrt{n})\) \(\Pr(\vert\bar{X}_n\vert < 0.1)\)
\(\mu = 2\) 8.9 0.000000
\(\mu = 0.2\) 0.9 0.237504
  • When \(\mu/(\sigma/\sqrt{n})\) is large, \(\bar{X}_n\) approaches zero so rarely that the nonexistent moment has no practical consequence.

  • When \(\mu/(\sigma/\sqrt{n})\) is small, \(\bar{X}_n\) passes near zero often enough to destabilize \(1/\bar{X}_n\).

Inference Consequences with \(n=20\)

The delta method CI: \(\;\;1/\bar{X}_n \;\pm\; z_{\alpha/2}\,\cdot\, s\,/\,(\sqrt{n}\,\bar{X}_n^2)\).

\(\mu\,/\,(\sigma/\sqrt{n})\) Coverage (nominal 95%)
\(\mu = 2\) 8.9 0.937
\(\mu = 0.2\) 0.9 0.792
  • When \(\mu/(\sigma/\sqrt{n})\) is large, the delta method CI performs adequately despite the nonexistent moment.

  • When \(\mu/(\sigma/\sqrt{n})\) is small, coverage degrades — the poor normal approximation to the sampling distribution translates directly into unreliable inference.

\(1/\bar{X}\): Delta Method vs. MC Distribution

The delta method predicts \(\sqrt{n}\,(1/\bar{X} - 1/\mu)\,/\sqrt{\sigma^2/\mu^4} \;\overset{\cdot}{\sim}\; N(0,1).\)

MC Simulation
Quantile \(N(0,1)\) \(\mu = 2\) \(\mu = 0.2\)
2.5% -1.96 -1.59 -9.24
5% -1.64 -1.37 -4.97
10% -1.28 -1.11 -2.71
25% -0.67 -0.60 -0.57
50% 0.00 0.01 -0.32
75% 0.67 0.74 0.20
90% 1.28 1.47 1.55
95% 1.64 2.00 3.59
97.5% 1.96 2.53 7.09

\(1/\bar{X}\): Delta Method vs. MC Distribution

  • \(\mu = 2\) (\(\mu/\text{SE} = 8.9\)): MC quantiles closely match \(N(0,1)\) — the delta method approximation is accurate.
  • \(\mu = 0.2\) (\(\mu/\text{SE} = 0.89\)): the distribution is heavily left-skewed with extreme left-tail quantiles — the normal approximation fails qualitatively, not just in scale.

Takeaway: Nonexistent Moments

  • \(\mathbb{E}[1/\bar{X}]\) does not exist for any \(\mu\) when \(X_i \sim N(\mu,\sigma^2)\).

  • When \(\mu/(\sigma/\sqrt{n})\) is large, the nonexistence is a mathematical fact with no practical consequences for MC.

  • When \(\mu/(\sigma/\sqrt{n})\) is small, the nonexistence manifests in unstable MC estimates.

Example 3 (cont’d)

  • Theory identifies \(\mu/(\sigma/\sqrt{n})\) as the key quantity governing the practical severity of the nonexistent moment.
    • For a more complete analysis, would examine how coverage of the delta method CI degrades as \(\mu/(\sigma/\sqrt{n}) \to 0\), generating a coverage curve indexed by this ratio.

Example 3 (cont’d)

  • Theory identifies \(\mu/(\sigma/\sqrt{n})\) as the key quantity governing the practical severity of the nonexistent moment.

  • The same structure arises whenever an estimator involves a denominator that can pass near zero. When the denominator is approximately normally distributed and centered at its true value, the ratio of the true value to its SE is the natural index — it measures how many standard deviations separate the denominator from zero.

Connection to IV Estimation

  • The same structure arises w/ IV estimation: the exactly-identified IV estimator can have no finite moments regardless of instrument strength; though instrument strength affects how practically severe the tail problem is at given \(n\).
  • Separately, when instruments are weak, the asymptotic normal approximation to the sampling distribution of the IV estimator is poor — a failure of \(G_n(\cdot, F) \approx N(\cdot)\), not just a moment problem.

Ex. 4: Inference w/ Unreliable s.e.

Consider the linear regression \(Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\) with heteroskedastic errors: \(\text{Var}(\varepsilon_i \mid X_i) = \sigma^2 X_i^2.\)

  • OLS \(\hat{\beta}_1\) is still consistent and asymptotically normal, but inference depends on the standard error:

    • Homoskedastic SE validity based on \(\text{Var}(\varepsilon_i \mid X_i) = \sigma^2\); inconsistent here.
    • HC1 (White) robust SE — consistent under heteroskedasticity.
    • HC3 SE — finite-sample correction.

Example 4: Theory Guides the DGP

With \(\text{Var}(\varepsilon_i \mid X_i) = \sigma^2 X_i^2\), the degree of heteroskedasticity depends on the distribution of \(X_i\).

  • If \(X_i\) has a wide range, \(X_i^2\) varies a lot

    \(\Rightarrow\) severe heteroskedasticity.

  • If \(X_i\) is nearly constant, \(X_i^2 \approx c\)

    \(\Rightarrow\) approximately homoskedastic.

DGP for Simulation

\[\begin{align*} Y_i & = \beta_0 + \beta_1 X_i + \varepsilon_i\\ X_i & = |Z_i| + 0.5\\ \varepsilon_i & = \sigma X_i \cdot \eta_i \end{align*}\]

with

  • \(Z_i, \eta_i \sim N(0,1),\) independent of each other.

  • \(\beta_0 = 1,~~ \beta_1 = 2, ~~\sigma = 1\).

SE Comparison: Setup

set.seed(5551)
B <- 10000; alpha <- 0.05
beta_0 <- 1; beta_1 <- 2; sigma <- 1
library(sandwich)
library(lmtest)

sim_se_test <- function(n) {
  X   <- abs(rnorm(n)) + 0.5
  eps <- rnorm(n) * sigma * X
  Y   <- beta_0 + beta_1 * X + eps
  fit <- lm(Y ~ X)
  b1 <- unname(coef(fit)[2])
  se_hom <- coeftest(fit)[2, 2]
  se_hc0 <- coeftest(fit, vcov = vcovHC(fit, type = "HC0"))[2, 2]
  se_hc1 <- coeftest(fit, vcov = vcovHC(fit, type = "HC1"))[2, 2]
  se_hc3 <- coeftest(fit, vcov = vcovHC(fit, type = "HC3"))[2, 2]
  c(rej_hom = abs((b1 - beta_1) / se_hom) > qnorm(1 - alpha/2),
    rej_hc0 = abs((b1 - beta_1) / se_hc0) > qnorm(1 - alpha/2),
    rej_hc1 = abs((b1 - beta_1) / se_hc1) > qnorm(1 - alpha/2),
    rej_hc3 = abs((b1 - beta_1) / se_hc3) > qnorm(1 - alpha/2))
}

SE Comparison: Empirical Size (nominal 5%)

Simulation code
n_vals <- c(20, 50, 100, 500)
size_results <- sapply(n_vals, function(nn) {
  res <- replicate(B, sim_se_test(nn))
  rowMeans(res)
})
colnames(size_results) <- paste0("n=", n_vals)
SE Type \(n = 20\) \(n = 50\) \(n = 100\) \(n = 500\)
Homoskedastic 0.192 0.192 0.190 0.192
HC0 (White) 0.188 0.112 0.086 0.057
HC1 (df-corrected) 0.167 0.107 0.083 0.057
HC3 0.100 0.078 0.069 0.054

Empirical Size: Reading the Table

  • Homoskedastic SE: over-rejection that does not vanish with \(n\) — consistent with theory (the SE converges to the wrong limit).

  • HC0 (White): consistent, but over-rejects at small \(n\) — the finite-sample bias of the sandwich estimator matters.

  • HC1: the \(n/(n-k)\) correction helps slightly.

  • HC3: best finite-sample performance — the leverage-based correction addresses the downward bias in HC0.

Visualizing the Problem: \(t\)-Statistic Distributions

  • With the homoskedastic SE, the \(t\)-statistic is too dispersed — the SE underestimates the true variability of \(\hat{\beta}_1\). The \(N(0,1)\) reference distribution is too narrow.

Takeaway: Standard Errors Matter

  • The estimator \(\hat{\beta}_1\) can be perfectly fine (consistent, asymptotically normal), yet inference breaks down if the SE is unreliable.

  • An inconsistent SE (wrong formula) causes size distortions that do not vanish with \(n\).

  • A consistent but imprecise SE (HC1 at small \(n\)) causes size distortions that vanish as \(n \to \infty\), but may be substantial at given \(n\).

  • HC3 provides a finite-sample correction.

Example 4: Fuller Analysis

Theory tells us the severity of the problem depends on:

  • Degree of heteroskedasticity: how much \(\text{Var}(\varepsilon_i \mid X_i)\) varies across \(i\). In our DGP, this is controlled by the distribution of \(X_i\).

  • Leverage: observations with high leverage (extreme \(X_i\) values) have disproportionate influence on the sandwich estimator’s finite-sample bias.

  • Sample size: HC0 and HC1 are consistent, so their over-rejection vanishes — but the question is how large the distortions are at given \(n\).

Example 4: Fuller Analysis (cont’d)

A fuller MC study would vary:

  • the heteroskedasticity function

  • the \(X\) distribution

    • e.g., try different distributions with the same variance but different tail behavior to see how leverage affects the bias of the sandwich estimator
  • sample size \(n\).

Example 5: Post-Model-Selection Inference

Suppose \[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \varepsilon_i, \qquad \varepsilon_i \stackrel{i.i.d.}{\sim} N(0, \sigma^2)\]

  • Parameter of interest: \(\beta_1\).

  • Two candidate models:

    • Long Regression: regress \(Y\) on \((1, X_1, X_2)\).
    • Short Regression: regress \(Y\) on \((1, X_1)\).

Example 5: DGP

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \varepsilon_i, \qquad \varepsilon_i \stackrel{i.i.d.}{\sim} N(0, \sigma^2)\]

Parameters: \[\begin{align*} &\beta_0 = 1, \quad \beta_1 = 2, \quad \sigma = 1, \quad n = 100\\ &\text{Corr}(X_1, X_2) = 0.75 \end{align*}\]

Post-Selection Estimator

A post-model-selection estimator of \(\beta_1\):

  1. Fit the full model; test \(H_0: \beta_2 = 0\) at level \(\alpha_{\text{sel}} = 0.05\).
  2. If reject: keep the full model. If fail to reject: drop \(X_2\).
  3. Report \(\hat{\beta}_1\) from the selected model.
  • This is a natural workflow — researchers routinely test whether to include covariates.

  • But the resulting \(\hat{\beta}_1^{\text{post}}\) is a random mixture of two different estimators, and its distribution is generally not normal.

Example 5: Theoretical Background

Under the normal-error DGP with fixed \(X\): \[(\hat\beta_1, \hat\beta_2) \sim N\!\left(\beta,\; \sigma^2 (X'X)^{-1}\right)\]

  • In this two regressor setting, \(\text{Corr}(X_1, X_2) > 0\) \(\Rightarrow\) the off-diagonals of \((X'X)^{-1}\) are negative \(\Rightarrow\) \(\text{Cov}(\hat\beta_1, \hat\beta_2) < 0\).

  • Selection event: \(\{|t_2| \leq c\}\) truncates the range of \(\hat\beta_2\). Since \(\text{Corr}(\hat\beta_1, \hat\beta_2)<0\), this truncation shifts the conditional distribution of \(\hat\beta_1\).

  • OVB when \(X_2\) is dropped: the short regression estimator satisfies \(\tilde\beta_1 \xrightarrow{p} \beta_1 + \gamma \beta_2\) where \(\gamma = \text{Cov}(X_1,X_2)/\text{Var}(X_1)\).

Example 5: What Theory Predicts

The post-selection estimator \(\hat\beta_1^{\text{post}}\) is a mixture:

\[\hat\beta_1^{\text{post}} = \hat\beta_1^{\text{long}} \cdot \mathbf{1}[\text{reject}] + \tilde\beta_1^{\text{short}} \cdot \mathbf{1}[\text{fail to reject}]\]

Theory predicts:

  • \(\beta_2 = 0\): no problem.

  • \(\beta_2\) large: test almost always rejects \(\Rightarrow\) post-selection \(\approx\) full model \(\Rightarrow\) no problem.

Example 5: What Theory Predicts

The post-selection estimator \(\hat\beta_1^{\text{post}}\) is a mixture:

\[\hat\beta_1^{\text{post}} = \hat\beta_1^{\text{long}} \cdot \mathbf{1}[\text{reject}] + \tilde\beta_1^{\text{short}} \cdot \mathbf{1}[\text{fail to reject}]\]

Theory predicts:

  • \(\beta_2\) intermediate: worst case — the test frequently fails to reject, selecting the short model (which has OVB), and conditioning on the selection event biases both components.

Post-Selection: Setup

set.seed(5551)
B <- 100000; n <- 100; sigma <- 1
beta_0 <- 1; beta_1 <- 2
alpha_sel <- 0.05  # selection test level

sim_post_selection <- function(n, beta_2, rho) {
  Z1 <- rnorm(n); Z2 <- rnorm(n)
  X1 <- Z1
  X2 <- rho * Z1 + sqrt(1 - rho^2) * Z2
  eps <- rnorm(n, 0, sigma)
  Y <- beta_0 + beta_1 * X1 + beta_2 * X2 + eps
  fit_full <- lm(Y ~ X1 + X2)
  p_val_b2 <- unname(summary(fit_full)$coefficients[3, 4])
  if (p_val_b2 < alpha_sel) {
    fit_sel <- fit_full
  } else {
    fit_sel <- lm(Y ~ X1)
  }
  b1_post <- unname(coef(fit_sel)["X1"])
  se_post <- unname(summary(fit_sel)$coefficients["X1", 2])
  ci_post <- b1_post + c(-1, 1) * qnorm(0.975) * se_post
  b1_full <- unname(coef(fit_full)["X1"])
  se_full <- unname(summary(fit_full)$coefficients["X1", 2])
  ci_full <- b1_full + c(-1, 1) * qnorm(0.975) * se_full
  c(b1_post = b1_post, se_post = se_post,
    covers_post = ci_post[1] <= beta_1 & beta_1 <= ci_post[2],
    b1_full = b1_full, se_full = se_full,
    covers_full = ci_full[1] <= beta_1 & beta_1 <= ci_full[2],
    selected_full = as.numeric(p_val_b2 < alpha_sel))
}

Case 1: \(\beta_2 = 0\), \(\rho = 0.75\)

res_b2_0 <- replicate(B, sim_post_selection(n, beta_2 = 0, rho = 0.75))
DGP: beta_0 = 1, beta_1 = 2, beta_2 = 0, sigma = 1, n = 100, rho = 0.75
Fraction selecting full model: 0.0494 
--- Post-selection estimator ---
Bias:     -0.000951 
Coverage: 0.93 
--- Always-full estimator ---
Bias:     -0.00136 
Coverage: 0.948 
  • When \(\beta_2 = 0\), the post-selection estimator is fine: both the long and short regression are unbiased for \(\beta_1\) when \(\beta_2 = 0\), and the symmetry of the selection event around \(\hat{\beta}_2 = 0\) preserves unbiasedness of the mixture.

Case 2: \(\beta_2 = 0.25\), \(\rho = 0.75\)

res_b2_025 <- replicate(B, sim_post_selection(n, beta_2 = 0.25, rho = 0.75))
DGP: beta_0 = 1, beta_1 = 2, beta_2 = 0.25, sigma = 1, n = 100, rho = 0.75
Fraction selecting full model: 0.365 
--- Post-selection estimator ---
Bias:     0.0768 
Coverage: 0.684 
--- Always-full estimator ---
Bias:     0.000353 
Coverage: 0.947 
  • Now the post-selection estimator is biased — from both OVB in the short model and conditioning on the selection event — and has poor coverage.

Post-Selection: Sampling Distribution

  • The post-selection distribution is a mixture: shifted left when \(X_2\) is dropped (OVB), centered when kept. This is the theory made visible.
  • The always-full estimator has a clean normal distribution.

Post-Selection: Varying \(\beta_2\)

  • Coverage of the post-selection CI is worst for intermediate \(\beta_2\) — exactly as theory predicted: large enough to cause OVB when dropped, but small enough that the test often fails to detect it.

Post-Selection: Why Does This Happen?

Validity of standard CI based upon:

  • The model was known in advance (not selected from the data).
  • The estimator has a single normal sampling distribution.

Post-Selection: Why Does This Happen? (cont’d)

But the post-selection estimator:

  • Is a mixture of two estimators (full and restricted), with mixture weights that depend on the data.
  • Has a sampling distribution that is generally not normal. Under local alternatives \(\beta_2 = c/\sqrt{n}\), the mixture persists asymptotically.
  • The standard SE does not account for the model selection step.

Example 5: Fuller Analysis

Theory identifies several quantities that govern the severity of the problem:

  • Correlation \(\rho\): higher \(|\rho|\) \(\Rightarrow\) larger OVB \(= \gamma \beta_2\) when \(X_2\) is dropped, and stronger dependence between \(\hat\beta_1\) and the selection event.

  • Signal-to-noise ratio \(\beta_2/\sigma\): determines the power of the selection test, and hence the mixing proportions.

  • Sample size \(n\): more power to detect \(\beta_2 \ne 0\), but the problem doesn’t vanish — the dip shifts toward smaller \(\beta_2\) as \(n\) grows.

A fuller MC study would vary \(\rho\), \(n\), and the number of candidate regressors (the problem compounds with more selection steps).

Summary of Examples

Example Key Lesson
\(\hat{p}\) vs. Laplace Bias–variance tradeoff; shrinkage can reduce MSE
\(1/\hat{p}\) vs. \(1/\tilde{p}\) Small denominators degrade delta method approximation; regularization helps
\(1/\bar{X}\) Nonexistent moments; diagnostics from Part 1

Summary of Examples (cont’d)

Example Key Lesson
Heteroskedastic SEs Wrong SE \(\Rightarrow\) invalid inference, even with good estimator
Post-model-selection Selection step \(\Rightarrow\) mixture distribution, invalid CIs
  • In every case, theory identified the problem and MC simulation revealed its magnitude at specific \((n, F)\).

Connection to \(G_n(\cdot, F)\)

All the quantities we’ve computed are features of \(G_n(\cdot, F)\):

  • Bias: \(\int u \, dG_n(u, F) - \theta(F)\).

  • MSE: \(\int (u - \theta)^2 \, dG_n(u, F)\).

  • Size: \(G_n(-z_{\alpha/2},\, F_0) + (1 - G_n(z_{\alpha/2},\, F_0))\),
    for the \(t\)-statistic under \(H_0\).

  • Power: same expression, evaluated at \(F_1 \ne F_0\).

  • Coverage: \(G_n\) applied to the CI endpoints.

MC simulation approximates \(G_n(\cdot, F)\) by its empirical analog \(\hat{G}_{n,B}(\cdot, F)\).

The MC Recipe: Summary

Looking Ahead: The Bootstrap

  • In MC simulation, we know \(F\) and approximate \(G_n(\cdot, F)\) by simulation.

  • In practice, \(F\) is unknown — we observe only one sample \(Y_1,\ldots,Y_n\).

  • The bootstrap idea: replace the unknown \(F\) with an estimate \(\hat{F}_n\) (e.g., the empirical distribution), then approximate \(G_n(\cdot, \hat{F}_n)\) by simulation.

  • The MC recipe is exactly the same — only the input changes from \(F\) to \(\hat{F}_n\).

Looking Ahead (cont’d)

MC Simulation Bootstrap
Distribution \(F\) (known, chosen) \(\hat{F}_n\) (estimated from data)
Draws \(Y_i^{(b)} \sim F\) \(Y_i^{*(b)} \sim \hat{F}_n\)
Statistic \(T_n^{(b)} = T(Y^{(b)})\) \(T_n^{*(b)} = T(Y^{*(b)})\)
Approximates \(G_n(\cdot, F)\) \(G_n(\cdot, \hat{F}_n)\)
  • The bootstrap works because \(\hat{F}_n \to F\), so \(G_n(\cdot, \hat{F}_n) \approx G_n(\cdot, F)\). The computational machinery is identical — only the conceptual justification differs.

Practical Advice

  • Choosing \(B\): for size/coverage, \(B = 10{,}000\) gives MC standard error \(\le 0.005\). For power curves, \(B = 5{,}000\) per point is often sufficient.

  • Choosing \(n\): examine a range of \(n\) to see how finite-sample behavior approaches the asymptotic benchmark.

  • Choosing \(F\): try multiple DGPs — normal errors, skewed errors, heavy tails, heteroskedasticity — to stress-test the method. Let theory guide which features of \(F\) to vary.

  • Report MC standard errors: treat MC estimates as estimates, not exact quantities.

Summary

  • MC simulation lets us evaluate estimators and inference procedures at any finite \(n\) and for any \(F\) we can draw from.

  • The sampling distribution \(G_n(\cdot, F)\) is the central object. MC approximates it via repeated simulation.

  • Theory and MC are complements: theory identifies the key quantities and predicts where problems arise; MC quantifies how severe they are at specific \((n, F)\).

  • The same computational framework — simulate from a known distribution and compute statistics — underlies the bootstrap, which we will study next.